Project Title: Netflix Data Analysis with Python¶

Netflix is one of the largest providers of online streaming services, boasting a massive subscriber base that generates vast amounts of data. In this project, I’m going to walk you through a data science project focused on analyzing Netflix data using Python.

Introduction to Netflix Data Analysis

Netflix has continually adapted its business model to meet evolving market demands, transitioning from on-demand DVD rentals to becoming a major producer of original content. This shift has generated a wealth of data that can be analyzed to glean insights into Netflix’s content strategy and user preferences.

In this project, I will explore several key aspects of Netflix's data to understand what drives their business. Key areas of analysis include:

Content availability: Understanding what content is available on Netflix.

Content similarity: Analyzing the similarities between different content.

Network analysis: Examining the relationships between actors and directors.

Business focus: Identifying the trends Netflix is focusing on.

Sentiment analysis: Evaluating the sentiment of the content available on Netflix. Dataset Overview

The dataset I’m using for this Netflix data analysis contains information on TV shows and movies streamed on Netflix as of 2019. This dataset is provided by Flixable, a third-party research engine for Netflix.

In [3]:
import numpy as np 
import pandas as pd 
import plotly.express as px 
from textblob import TextBlob 
In [2]:
!pip install textblob
Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
     -------------------------------------- 626.3/626.3 kB 2.5 MB/s eta 0:00:00
Collecting nltk>=3.8
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 4.2 MB/s eta 0:00:00
Requirement already satisfied: click in c:\users\user\anaconda3\lib\site-packages (from nltk>=3.8->textblob) (8.1.3)
Requirement already satisfied: tqdm in c:\users\user\anaconda3\lib\site-packages (from nltk>=3.8->textblob) (4.64.1)
Requirement already satisfied: joblib in c:\users\user\anaconda3\lib\site-packages (from nltk>=3.8->textblob) (1.1.0)
Requirement already satisfied: regex>=2021.8.3 in c:\users\user\anaconda3\lib\site-packages (from nltk>=3.8->textblob) (2022.7.9)
Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from click->nltk>=3.8->textblob) (0.4.5)
Installing collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1 textblob-0.18.0.post0
In [5]:
dff = pd.read_csv(r'C:\Users\User\Desktop\NDA\netflix_titles.csv')
In [6]:
dff.columns
Out[6]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

Distribution of Content:

To begin the task of analyzing Netflix data, I’ll start by looking at the distribution of content ratings on Netflix:

In [7]:
z = dff.groupby(['rating']).size().reset_index(name='counts')
In [8]:
pieChart = px.pie(z, values='counts', names='rating', 
                  title='Distribution of Content Ratings on Netflix',
                  color_discrete_sequence=px.colors.qualitative.Set3)
pieChart.show()

The graph above shows that the majority of content on Netflix is categorized as “TV-MA”, which means that most of the content available on Netflix is intended for viewing by mature and adult audiences.

Top 5 Actors and Directors: Now let’s see the top 5 successful directors on this platform:

In [9]:
dff['director']=dff['director'].fillna('No Director Specified')
filtered_directors=pd.DataFrame()
filtered_directors=dff['director'].str.split(',',expand=True).stack()
filtered_directors=filtered_directors.to_frame()
filtered_directors.columns=['Director']
directors=filtered_directors.groupby(['Director']).size().reset_index(name='Total Content')
directors=directors[directors.Director !='No Director Specified']
directors=directors.sort_values(by=['Total Content'],ascending=False)
directorsTop5=directors.head()
directorsTop5=directorsTop5.sort_values(by=['Total Content'])
fig1=px.bar(directorsTop5,x='Total Content',y='Director',title='Top 5 Directors on Netflix')
fig1.show()

From the above graph it is derived that the top 5 directors on this platform are:

Raul Campos Jan Suter Jay Karas Marcus Raboy Jay Chapman

Now let’s have a look at the top 5 successful actors on this platform:

In [11]:
dff['cast']=dff['cast'].fillna('No Cast Specified')
filtered_cast=pd.DataFrame()
filtered_cast=dff['cast'].str.split(',',expand=True).stack()
filtered_cast=filtered_cast.to_frame()
filtered_cast.columns=['Actor']
actors=filtered_cast.groupby(['Actor']).size().reset_index(name='Total Content')
actors=actors[actors.Actor !='No Cast Specified']
actors=actors.sort_values(by=['Total Content'],ascending=False)
actorsTop5=actors.head()
actorsTop5=actorsTop5.sort_values(by=['Total Content'])
fig2=px.bar(actorsTop5,x='Total Content',y='Actor', title='Top 5 Actors on Netflix')
fig2.show()

From the above plot, it is derived that the top 5 actors on Netflix are:

Anupam Kher Om Puri Shah Rukh Khan Takahira Sakurai Boman Irani

Analyzing Content on Netflix:¶

The next thing to analyze from this data is the trend of production over the years on Netflix:

In [12]:
df1=dff[['type','release_year']]
df1=df1.rename(columns={"release_year": "Release Year"})
df2=df1.groupby(['Release Year','type']).size().reset_index(name='Total Content')
df2=df2[df2['Release Year']>=2010]
fig3 = px.line(df2, x="Release Year", y="Total Content", color='type',title='Trend of content produced over the years on Netflix')
fig3.show()

The above line graph shows that there has been a decline in the production of the content for both movies and other shows since 2018.

At last, to conclude our analysis, I will analyze the sentiment of content on Netflix

In [13]:
dfx=dff[['release_year','description']]
dfx=dfx.rename(columns={'release_year':'Release Year'})
for index,row in dfx.iterrows():
    z=row['description']
    testimonial=TextBlob(z)
    p=testimonial.sentiment.polarity
    if p==0:
        sent='Neutral'
    elif p>0:
        sent='Positive'
    else:
        sent='Negative'
    dfx.loc[[index,2],'Sentiment']=sent
In [14]:
dfx=dfx.groupby(['Release Year','Sentiment']).size().reset_index(name='Total Content')
In [15]:
dfx=dfx[dfx['Release Year']>=2010]
fig4 = px.bar(dfx, x="Release Year", y="Total Content", color="Sentiment", title="Sentiment of content on Netflix")
fig4.show()

So the above graph shows that the overall positive content is always greater than the neutral and negative content combined.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: